Mac Magazin/MacEasy 21

home *** CD-ROM | disk | FTP | other *** search

/ Mac Magazin/MacEasy 21 / Mac Magazin and MacEasy Magazine CD - Issue 21.iso / Wissenschaft & Technik / yorick_docs folder / yorick_docs / FILE_FORMATS next >

Wrap

Text File | 1996-02-28 | 91KB | 1,898 lines

Binary Data File Formats and Descriptions ----------------------------------------- ------------------------------------------------------------------------------ 1. Introducing the netCDF and PDB formats ----------------------------------------- Scientific computing now takes place in a network of high performance UNIX workstations and mainframes. Many sites include machines from several manufacturers on a single network. In such a world it is crucial that programs be portable -- that is, that a program be written in a language and in a style which enables it to be compiled and run on as wide a variety of machines as possible. It is just as crucial that the results of a portable program be written in a portable form -- that is, in a form legible by any machine on the network. In order to save space and run fast, results of scientific calculations are written to disk in a binary format. Unlike text files, binary files are not guaranteed to make sense when read back by a machine other than the machine on which they were written. This is because the binary format used to represent numbers varies from one computer manufacturer to the next. Several solutions to this problem have emerged in the past few years. Two existing solutions -- netCDF files and PDB files -- will be described in detail here. A third, HDF vertex set format, is not too different in kind, and will not be discussed. After a thorough examination of the strengths and weaknesses of these two formats, a "data language" is described which is capable of describing a very general binary data file -- including any netCDF or PDB file as a special case. The netCDF format is simple and widely used, but its authors (Unidata, sponsored by NSF) do not describe the actual disk format of the data in the documentation that comes with the software. This is a peculiar omission, since no data format can be regarded as truly portable without being fully documented. Furthermore, anyone can see that a heavily used data format cannot be changed in an incompatible way in the future -- any changes must take the form of additional structural information which does not conflict with the existing file format. As you will see, the netCDF format is easily flexible enough to allow such additions without affecting the basic intelligibility of a netCDF file. The PDB file format is substantially more complicated and ambitious. It has been heavily used, but only at the Lawrence Livermore National Laboratory, where it was designed by Stewart Brown. PDB is a drastically smaller programming effort than netCDF, and it has not been cleaned up to remove the evidence of its evolution. Like netCDF, the underlying format of a PDB file is not described in the documentation for the package. Despite the historical imperfections remaining in the PDB format, its full disclosure is just as important as full disclosure of the netCDF format. The documentation for both netCDF and PDB concentrates on the programming interface for reading and writing the files. This is certainly understandable, but any file format could be accessed by many interfaces, and any interface could be realized using many file formats. The file format is the data; the interface merely represents the data. Only a complete description of the binary file format actually used by netCDF and PDB files makes any discussions of the merits and demerits of either format intelligible. In brief, a netCDF file consists of a small descriptive header, followed by the binary data described in the header. Both the header and the data are written to disk using the XDR (eXternal Data Representation) library invented by Sun. The XDR library converts integer and floating point numbers of various sizes into the IEEE standard floating point representation and big-endian order favored by Sun and many other computer manufacturers. Each variable is of type byte, char (same as byte), short (2-byte integer), long (4-byte integer), float (4-byte real), or double (8-byte real), each may be an array of zero or more dimensions, and each may have any number of named attributes with associated attribute values. The variables may be divided into a "non-record" group, and a "record" group. "Record" variables are physically grouped at the end of the data section, and the layout of this subsection may be repeated an "unlimited" number of times to capture time-varying data. Similarly, a PDB file begins with a very small header describing the primitive data formats and specifiying the address of a longer descriptive section at the end of the file. The binary data itself follows the small header. Next comes a section in which any compound data types are named and defined in terms of the primitive data types (char, short, int, long, float, double, and pointered data). Next comes a symbol table, where a variable name, type, and dimension information are associated with a disk address in the data section. Finally, a special section allows for corrections to and extensions of the format not envisioned when the other descriptive components of the format were designed. (Effectively, the PDB format has been a research tool for studying various notions about portable binary files.) The biggest difference between the netCDF and PDB strategies for data-portability is their handling of the primitive data formats. The netCDF strategy is to mandate the exact representation of numbers within the file. The PDB strategy is to describe the primitive data formats themselves, using a parameterization which is general enough to cover all currently interesting machines. Both strategies have strengths and weaknesses. However, there are two advantages to the PDB strategy which deserve special mention: First, since PDB files can use the number representations native to any machine, they can be written and read at a similar speed on any platform; there is no advantage to owning a machine with number representations matching those of a Sun. Second, PDB files can always represent a floating point number exactly (to the last bit) as long as they are read on a machine with the same number representations as the one on which they were written. For a netCDF file, it is in principle possible that floating point data read back in will differ slightly from what was written out, owing to a round-trip conversion through XDR format. In practice, this is rarely a problem, at least for the number representations in common use in the current generation of machines. Other differences arise because of the fact that the netCDF symbol table comes before the data, while the PDB symbol table comes afterward. The conflict here is between wanting to be able to add variables to the file after some of the variable data has already been written (PDB can, netCDF can't), and robustness in the sense of always having the data description present in case the code or machine dies after some, but not all, of the data has been written (netCDF is robust, PDB isn't). Finally, there is a considerable difference between the kinds of objects easily described using a netCDF as opposed to a PDB format. To a certain degree, this is really a matter of the programming interface, which can obviously be built in such a way as to add a level of abstraction not explicitly represented in the underlying file format. Thus, netCDF has "attributes" associated with each variable, and makes a simple provision for handling time-varying data. Conversely, PDB allows for compound data types (like C structs), and a simple disk representation of pointers. Despite all of these differences, one is struck by the common simple idea underlying both the netCDF and PDB file formats: The section of the file containing the binary data itself and the section which describes the binary data are completely separate. There is no formatting information mixed into the binary data. A second common design feature is that the symbol table information associates a variable name, type, and dimensions with the disk address where that variable begins. Because of this commonality, it is easy to use the PDB machinery to read netCDF files, or the netCDF machinery to read PDB files written with Sun-like primitive data formats, simply by providing an alternate routine to open the file and create the data structure the rest of the package uses to describe the file internally. The common features of netCDF and PDB, as well as their respective strong points, motivate the design of the generic binary data description language described in the last section of this report. This language is christened "Clog" for Contents Log. A generic programming interface capable of reading either netCDF or PDB files can be based on Clog. Moreover, the data description language can make it possible to process the data in a completely arbitrary binary file with a single programming interface. A portable scientific program which both reads its input from and writes its output to data-portable self-descriptive binary files, allows for the most efficient use of a heterogeneous network of high performance computing engines. The netCDF and PDB file formats are both suitable choices for the required binary I/O files, although each has definite strengths and weaknesses. A general binary data description language is capable of describing either netCDF or PDB files, and can draw on the strengths of each paradigm. ------------------------------------------------------------------------------ 2. The netCDF file format ------------------------- A netCDF file consists of a shallow hierarchy of data types based on the primitive data types defined by the XDR standard. This standard is fully described in the documents released by Sun Microsystems: XDR: eXternal Data Representation Standard, RFC1014 eXternal Data Representation: Sun Technical Notes XDR(3N) UNIX man page available on Internet by anonymous FTP to ftp.uu.net: /packages/bsd-sources/lib/librpc/doc/xdr.rfc.ms.Z /packages/bsd-sources/lib/librpc/doc/xdr.nts.ms.Z /packages/bsd-sources/lib/librpc/man/man3/xdr.3n.Z the man page is available online on many UNIX systems The netCDF software itself is available on Internet by anonymous FTP to unidata.ucar.edu, in the file /pub/netcdf/netcdf.tar.Z. This includes complete documentation of the programming interface provided by Unidata to write and read netCDF files. The primitive data types used in a netCDF file are: opaque - any number of bytes, padded with 0's to a multiple of 4 shorts - any number of 2-byte integers in big-endian order, padded with 0's to a multiple of 4 bytes (implemented using the XDR_PUTBYTES and XDR_GETBYTES macros, 4 bytes at a time -- would have been cleaner and equivalent to implement using 4-byte opaque) int, enum - one 4-byte integer in big-endian order long - one signed 4-byte integer in big-endian order float - one 4-byte IEEE floating point number in big-endian order double - one 8-byte IEEE floating point number in big-endian order u_long - one unsigned 4-byte integer in big-endian order The netCDF file itself is at the opposite end of the type hierarchy, with the following intermediate layers: NC_array - a counted list of objects of any other type The objects in an NC_array may be of variable length, so total length of an NC_array must be calculated by summing the lengths of its elements. NC_var - describes a variable in the file Each variable has a name, a data type (one of byte, char, short, long, float, or double), zero or more dimensions, zero or more attributes, and a disk address. NC_dim - Dimensions in a netCDF file are all named and shared among all variables in the file. Each dimension has a name and a length. NC_attr - Each attribute has a name and a value; the value can be zero or more objects of any of the primitive data types (byte, char, short, long, float, or double). Attributes can belong to one variable, or to the file as a whole. NC_iarray - a u_long count followed by that many ints NC_string - a u_long count followed by that many chars, written as an opaque 2A. Whole-file format ----- The entire netCDF file has the following data structure: u_long 0x43444601 "CDF\001", netCDF file magic number u_long numrecs number of records NC_array NC_dim dims name<-->dimension length associations NC_array NC_attr attrs global attributes for this file NC_array NC_var vars description of variables in this file <any> data the variables described by vars Here, the first column gives the type of the data identified in the second column, and described in the third column. Each object up to the data section is written immediately after the preceding item; to make sense of this part of the file, it must be read back sequentially -- that is, first the dims, then the attrs, then the vars. The vars, however, contain the disk addresses of all the variables in the data section, so the data in the file can be randomly accessed after vars has been read. Note that additional descriptive information could be added after vars without affecting the ability of the netCDF software to read the file. 2B. NC_array format ----- enum type 0 unspecified 1 byte 2 char 3 short 4 long 5 float 6 double 7 bitfield (private) 8 string (private, NC_string) 9 iarray (private, NC_iarray) 10 dimension (private, NC_dim) 11 variable (private, NC_var) 12 attribute (private, NC_attr) u_long count number of objects in array <any> objects byte, char written as xdr_opaque of count bytes short written as XDR_PUTBYTES of count shorts (byte pairs) all others written as a sequence of count objects Note that the length of an NC_array is 8 bytes plus the aggregate length of the array elements. 2C. NC_var format ----- NC_string name name of the variable NC_iarray assoc list of 0-origin indices into the array of dimensions (dims) for this file The dimensions in the list are listed slowest varying first. If the slowest dimension is the UNLIMITED dimension, this is a record variable. NC_array NC_attr attrs attributes for this variable enum type data type for this variable (values as for NC_array above) u_long len total number of bytes on disk The length of a netCDF record is the sum of the len fields of all record variables. u_long begin disk address (This is calculated on the basis of the known data lengths in the Unidata code, NOT obtained from xdr_getpos.) All non-record variables precede all record variables, to allow a the block of record variables to be treated as an array of an indeterminate number of record structure instances. 2D. NC_dim format ----- NC_string name name associated with dimension long size number of elements along dimension Note: A netCDF file may have zero or one UNLIMITED dimension, which is marked by size==0. If a variable has the UNLIMITED dimension, that must be its slowest varying dimension. Such variables are physically placed at the end of the data section, and numrecs copies of this "record section" exist at the end of the data section. (The record variables may occur anywhere in the vars list of variables for the file.) 2E. NC_attr format ----- NC_string name name of attribute NC_array <any> data value of attribute Notes: The data must be one of the "public" types (byte, char, short, long, float, or double). An NC_attr written to the attrs array at the beginning of the netCDF file is a "global" attribute which applies to the whole file. An NC_attr written to the attrs array in an NC_var applies only to that variable. 2F. NC_string format ----- u_long count number of characters in string opaque values count characters Note: The count does NOT include any trailing '\0' character; a count of 0 is interpreted as (NC_string *)0, NOT as a zero-length string. 2G. NC_iarray format ----- u_long count number of 4-byte ints in array int values count ints (written sequentially) Note: The count can be zero. ------------------------------------------------------------------------------ 3. The PDB file format ---------------------- The self-descriptive information in a netCDF file is stored using a variety of data types. Thus, a name is a u_long count followed by the actual characters as an opaque, a disk address is a u_long, and a data type is an enum. In contrast, the self-descriptive information in a PDB file is mostly character encoded, with certain characters set aside as delimiters of various sorts. Hence, a disk address or a size is converted to the characters of the equivalent decimal number in the PDB file self-description. To describe such character encoded data, the following discussion adopts a notation based on the format argument to the standard C library routines printf and scanf. The meaning of this notation is as follows: A quoted string represents a consecutive sequence of 8-bit bytes containing the ASCII representations for the characters in the string. Thus, "Hello" represents the five bytes 0x48, 0x65, 0x6c, 0x6c, 0x6f, in that order. The two characters "\" and "%" are exceptions to this rule. A "\" in a format string introduces an escape sequence which represents a single non-printable ASCII character. The only escape sequences required for the following discussion are "\n", "\t", "\001", and "\002". These have the following meanings: "\n" means a newline character, which can have any single one of the three values 0x0a (ASCII line feed), 0x0d (ASCII carriage return), or 0x1f (ASCII unit separator) This non-unique choice for a delimiter character is an inconvenient leftover from an early implementation of the original PDBLib programming interface. "\t" means a tab character, 0x09. "\001" means and ASCII SOH character, 0x01. "\002" means and ASCII STX character, 0x02. The "%" character in a format string introduces an escape sequence which represents the characters produced by converting data into a printable form. The only such format conversions required in the following discussion are: "%d" means the decimal equivalent of an int value, that is, a sequence of digits possibly preceded by a minus "-". "%ld" is the same thing for a long value "%s" means zero or more characters in a null-terminated string of ASCII characters Because the bytes 0x01, 0x02, 0x0a, 0x0d, and 0x1f ("\001", "\002", and "\n") are used as delimiters, these five characters may not occur in any string output with a "%s" in the PDB file format. A sixth character, 0x00, may not occur by the definition of "%s". As with the netCDF file format, particular programming interfaces to read and write PDB files may impose stricter limitations on the set of characters which are legal in variable names and data type names. A discussion of restrictions of this sort will follow the PDB file format description. 3A. Whole-file format ----- "!<<PDB:II>>!\n" HeadTok 13 bytes of identification byte count byte count of prim_info + 1 normally count= 24+sizeof(float)+sizeof(double) byte[count-1] prim_info parameterizations of short, int, long, float, double, and * primitive types, giving size, byte order, and floating point layout "%ld\001" float_bias exponent bias for float type "%ld\001\n" double_bias exponent bias for double type "%ld\001" chart_addr file byte address of structure chart "%ld\001\n" symtab_addr file byte address of symbol table <any> data the binary data PDB_chart chart the structure chart defining compound data types in terms of primitive data types and simpler compound types, begins at byte chart_addr of file PDB_symtab symtab the symbol table associating a variable name, type, and dimensions with a disk address, begins at byte symtab_addr of file PDB_extras extras corrections to and extensions of the PDB file format, begins at byte immediately following the symtab The prim_info array is broken down as follows: byte[6] sizeof(void *), sizeof(short), sizeof(int), sizeof(long), sizeof(float), sizeof(double) the number of bytes of six of the seven predefined primitive data types sizeof(char) is always 1 byte byte[3] orderof(short), orderof(int), orderof(long) the byte order of the three multibyte integer data types, 1 if the most significant byte is first, 2 if the least significant byte is first byte[sizeof(float)] permutation of bytes for float type byte[sizeof(double)] permutation of bytes for double type byte[7] bitsof(float) bit sizes and addresses of sign, exponent, and mantissa in a float byte[7] bitsof(double) bit sizes and addresses of sign, exponent, and mantissa in a double The meaning of the permutation and bitsof(...) for the float and double types is fully described below in the section on PDB parameterization of floating point layout. 3B. PDB_chart format ----- The structure chart consists of a series of structure definitions, representing the various compound data types used to describe the data in the file. Each structure definition begins with: "%s\001" base_type_name the name of the compound data type "%ld\001" size byte size of one instance of the data structure on disk The definition continues with one member descriptor per member of the data structure: "%s\001" descriptor basically "type name(dimensions)", described in detail below An array of 12 arrays of 5 doubles called "junk" would have the descriptor "double junk(12,5)". An individual structure definition ends with a newline character: "\n" end_def end of structure definition is thus always "\001\n" The end of the entire structure chart is marked by: "\002\n" end_chart This may occur before any structure definitions, in which case all of the variables in the file must have one of the primitive data types. Data can be either an instance of one of the primitive data types, an instance of a compound data type, or a pointer to an object. The data type of a pointer specified as a <full_type> string, which has the format <full_type> is <ws><base_type_name><indirection_level> where <ws> is zero or more of the whitespace characters space or tab, that is " " or "\t" <indirection_level> is zero or more asterisk or whitespace characters, that is "*" or " " or "\t". The level of indirection is the number of "*" characters; any <full_type> with a level of indirection greater than zero represents a pointer. The general format of a member descriptor is: <full_type><ws><member_name><dimlist> where <ws> is zero or more whitespace (" " or "\t") characters, but at least one such character if the <indirection_level> field of the <full_type> has zero characters <full_type> is the data type of this member; its <base_type_name> is either a primitive data type name, or the name of a previously defined compound data type, <member_name> is the member name associated with this descriptor, and <dimlist> is either zero characters (if the member is a scalar), or <open_dimlist><dimlist_interior><close_dimlist> where <open_dimlist> is either "(" or "[", <close_dimlist> is either ")" or "]", and <dimlist_interior> is a comma "," delimited list of or "%ld" length number of elements along this dimension or "%ld:%ld" origin, max_i origin is a suggested minimum index value along this dimension, and max_i is origin+length-1, the maximum index value along this dimension The <dimlist_interior> may contain whitespace (" " or "\t") characters anywhere except within the "%ld" fields. For multidimensional lists, the dimensions are listed slowest varying first (but see "Major-Order" in PDB_extras below). The names of the primtive data types are: "char" same as C language char "short" same as C language short "integer" same as C language int "long" same as C language long "float" same as C language float "double" same as C language double "*" similar to void* in C language, but requires a pointee type when used as the type of a member or variable 3C. PDB_symtab format ----- The symbol table consists of a series of variable definitions which specify the variable name, data type, dimensions, and disk address. Each defintion has the following format: "%s\001" name the name of the variable "%s\001" full_type the full data type name This is of the form <full_type> as described in PDB_chart above. "%ld\001" number the total number of full_type objects, which is the 1 or the product of the dimension lengths "%ld\001" address the byte address of the first byte of this data in the file The variable definition continues with one (origin, length) pair for each dimension associated with the variable. As for the dimensions in a dimension descriptor, the slowest varying dimension is listed first (but see "Major-Order" in PDB_extras below). "%ld\001" origin suggested minimum index value for this dimension "%ld\001" length number of elements along this dimension (NOT maximum index value as in member descriptor) The variable definition concludes with a newline: "\n" end_def end of variable definition is thus always "\001\n" The end of the entire symbol table is marked by a second consecutive newline: "\n" end_symtab end of symbol table is thus always "\n\n" (unless it is empty, which is not a very interesting case) 3D. PDB_extras format ----- The extras section begins with the byte immediately following end_symtab, the end of the symbol table. The extras section consists of a sequence of extra blocks. Each extra block consists of a marker of the form: "%s:" extra_id name of the "extra" followed by any amount of textual data (except for the "Alignment" extra_id, see below), and ending with: "\n" This is not necessarily the first "\n" associated with the extra_id, but if the extra_id was not recognized, the characters following a "\n" are scanned for "%s:" before the next "\n" to try to match a known extra_id The end of the entire extras section is marked by a second consecutive newline: "\n" end_extras Unless the extras section is empty, it therefore ends with "\n\n". The following extra_id names have meanings in version 7 PDB files: "Alignment", "Major-Order", "Primitive_Types", "Offset", "Version", and "Casts". These are listed in rough order of importance for interpreting the data in a PDB file. Here are the formats for these extras blocks: "Alignment:" begin block which gives the alignments of the primitive data types within structures byte char_align alignment boundary for char byte ptr_align alignment boundary for pointers (*) byte short_align alignment boundary for short byte int_align alignment boundary for int byte long_align alignment boundary for long byte float_align alignment boundary for float byte double_align alignment boundary for double "\n" end block of alignments Note: It would have been more consistent to print the alignments in ASCII as "%d\001". The 7-byte format shown above would present a problem if any of the alignments happened to be one of the ASCII newline characters. In practice, this doesn't ever happen, since the alignments are always powers of two. The "Alignment" extra corrects a critical oversight in the PDB prim_info (see whole file format above) data. Namely, the byte offset of a structure member cannot be calculated without knowing whether the target machine/compiler places alignment restrictions on the various primitive data types. Therefore, until the "Alignment" extra has been read, no compound data type defined in the structure chart has a precise meaning. And yes, it is annoying that the extras section cannot be read before the symbol table, and the data types used in the symbol table are defined in the structure chart, and the structure chart cannot be interpreted without the "Alignment" extra. Note that the disk address of a variable is NOT necessarily aligned to be a multiple of the alignment of its data type. Alignment applies only to the offset of a structure member from the beginning of a structure instance. The alignment of a structure member which is itself a compound data type is computed as the largest alignment of any of its own members. (There exist machines and compilers for which this calculation is incorrect, but even on such machines, practical examples of structures which fail the simple alignment calculation required by the PDB file format are rare.) In practice, alignments are always powers of two, so the largest alignment of any member of a structure is also the least common multiple of all the member alignments. "Major-Order:" begin Major-Order block "%d\n" dim_order "101" if first dimension varies slowest (default) "102" if first dimension varies fastest The "Major-Order" extra MUST be interpreted in order to make sense of the structure chart and symbol table, since it changes the meaning of the dimension lists in both structure member descriptors and variable definitions. If the "Major-Order" extra is not present, the default is that the first dimension listed is the slowest varying dimension, and the last dimension listed is the fastest varying. Thus, a structure member with the descriptor "int x(2,3)" and the default dim_order means that the six associated values are, in order, x(0,0), x(0,1), x(0,2), x(1,0), x(1,1), x(1,2). If the dim_order is 102, the same descriptor describes the six values in the order x(0,0), x(1,0), x(0,1), x(1,1), x(0,2), x(1,2). Again, it is an annoyance that the "Major-Order" is known only after the symbol table has been read. Unless the "Major-Order" extra is properly interpreted, the topology of multidimensional arrays in a PDB file will be wrong. "Primitive-Types:\n" begin Primitive-Types block The Primitive-Types block is an adjunct to the structure chart, which allows primitive data types other than char, short, int, long, float, and double to be defined. These primitive types can be used as the <base_type_name> either in the symbol table, or in a member descriptor in the structure chart, just like any other data type. Each primitive type begins: "%s\001" base_type_name the name of the primitive data type "%ld\001" size byte size of one instance of the primitive type on disk "%d\001" alignment the alignment; the byte offset of a structure member of this data type will always be a multiple of this "%d\001" order 1 if the most significant byte is first, 2 if the least significant byte is first, and -1 if f_flag is "NO-CONV" or "FLOAT" "%s\001" p_flag "ORDER" if a byte order permutation follows, else "DEFORDER" "%d\001"[size] permutation if p_flag=="ORDER", the permutation is listed as size ASCII numbers These are in the same order as the permutations in the prim_info above. "%s\001" f_flag "NO-CONV" if the data is opaque, "FIX" if the data should be transformed as an integer, and "FLOAT" if the data should be transformed as a floating point "%ld\001"[8] fp_format if f_flag=="FLOAT", the 8 numbers parameterizing the floating point layout are listed as ASCII numbers These are in the same order as the floating point descriptions in prim_info above, except that the exponent bias is added as an eighth element of fp_format. "\n" end_primitive ends the definition of this primitive data type The end of the entire Primitive-Types block is marked by: "\002\n" If there are no additional primitive types, the entire block is "Primitive-Types:\n\002\n" Once again, the Primitive-Types information is required to interpret the meaning of the structure chart and symbol table, but cannot be read until after the symbol table. "Offset:%d\n" default_origin Specifies the default dimension origin for member descriptor dimension lists in the structure chart. The default default_origin is zero. "Version:%d|%s\n" version, date PDB version number (7 for the format described here) and file creation date string "Casts:\n" begin Casts block The Casts block provides additional information about members of data structures which are pointers. Specifically, another member of the data structure may be of type char *, and point to a string which is of the form <base_type_name><pointer_indicator>, which is the "true data type" of the pointee (the <base_type_name> in the first member descriptor is just a dummy, like void *). Whether or not this information is of any use depends on the programming interface. In any event, the only possible use is in writing new instances of the structure, since on read, the "true type" of the pointee is always known. For each type-cast member of a data structure, there is one entry of the form: "%s\001" base_type the <base_type_name> of the data structure containing the cast_member and type_member "%s\001" cast_member the <member_name> of a member with a pointer data type "%s\001\n" type_member the <member_name> of a member of type char *, whose value (after dereference) contains a string representing the "true type" of the pointee from the cast_member The end of the Casts block is marked by: "\002\n" If there are no casts, the entire Casts block is "Casts:\n\002\n". The end of the entire extras section is marked by a second consecutive newline: "\n" end_extras end of extras section is thus always "\n\n" (unless it is empty) 3F. Pointee format ----- The PDB pointer/pointee format is not optimal, but it does model several of the most important practical uses of pointers in the C programming language. Any variable with a full_type containing one or more trailing asterisk "*" characters is a pointer variable, and any member descriptor with asterisks preceding the <member_name> is a pointer member. The pointer itself has no representation at all in the PDB file. A pointer member has a size and alignment within its data structure, but the bytes stored there are garbage. A pointer variable takes up no space at all on disk; its address is the address of the first pointee (the only pointee if the variable is a scalar). Every pointee consists of a descriptive header of indeterminate length, possibly followed by the pointee data itself. A header-only pointee contains the disk address of the pointee which is followed by the data. This allows multiple pointers to the same data without multiple disk copies of the data. The format of a PDB pointee is: "%ld\001" nitems the number of objects in the pointee (1 if it is a scalar; otherwise the pointee is interpreted as a one dimensional array) "%s\001" full_type <base_type_name><pointer_indicators> specifying the data type of the pointee "%ld\001" address disk address of beginning of this pointee if data_here!=0, otherwise the address of the pointee with data_here!=0 containing the data "%d\001\n" data_here 1 if the first byte of the data immediately follows the "\n", 0 if the address points to another pointee, which is guaranteed to have data_here==1 <any> data only present if data_here!=0 A NULL pointer is marked by a pointee with nitems==0, address==-1, and data_here==0. The full_type of a pointee need not agree with the type expected from the nominal type of its pointer. In effect, every PDB pointer is a C void *, since the pointee contains the data type and number of items. (The original PDBLib programming interface, however, requires the Casts extra in order to be able to write data of a different type than expected on the basis of the pointer declaration; a pointer variable of a type different than one dereference of its full_type cannot be written at all with this interface. The Casts extra is not required for PDBLib to correctly read "cast" data in either pointer variables or pointer members. The features or limitations of a particular programming interface have no bearing on the format of a PDB file, in any event.) The major drawback of the PDB pointer/pointee format is that nothing at a known address actually points to the first pointee; the pointee addresses are stored only in the pointees themselves. The address of a pointee is determined as follows: The address of the first pointee of a pointer variable is the address of the variable. (Since the pointers themselves have no disk representation, they are not written). The address of the pointee corresponding to the first pointer member of the first element of an array of structure instances is the address of the byte immediately following the array of instances. (The structure instance array actually takes up space on disk, even if it consists entirely of pointers. The size and alignment of pointer members are specified in the small header for the whole file and in the Alignment extra, respectively. The value of the pointer member itself is meaningless.) When the first pointee has been completely written, including any pointees corresponding to pointer members of its own data type, the pointee corresponding to the second pointer member of the first element of the structure instance array is written starting at the address following the first pointee and all its descendants. This continues until the last pointer member of the first array element, after which comes the first pointer member of the second array element, and so on. The recursive algorithm used to write or read a PDB pointee can be schematically indicated by a recursive function pdb_object, which performs serial I/O on an array of nitems objects of type full_type, beginning at a specified disk address, and returning the address of the byte following what has just been read or written. Any actual interface would be substantially more complicated than this, since additional input arguments would be required to specify data to be written, and additional output arguments would be required to return the data read. Nevertheless, here is the schema: long pdb_object(full_type, nitems, address) { if ( is_a_pointer(full_type) ) { while (nitems--) { address= read_or_write_pointee_header(address); if ( not_seen_before(address) ) address= pdb_object( dereference_type(full_type), 1, address ); } } else { address= read_or_write_object_array( full_type, nitems, address ); while (nitems--) { while (pointer= next_pointer_member(full_type)) { address= pdb_object( pointer_type(pointer), pointer_nitems(pointer), address ); } } } return address; } Note that this algorithm does not permit partial read or write operations on array pointer variables or structure instances containing pointer members. The addresses of the pointees are only revealed by performing the entire sequence of read or write operations on a complete variable. 3G. PDB parameterization of floating point layouts ----- Converting from one floating point format to another is far easier than converting from any floating point format into a textual representation of a number. In general, a floating point conversion to any format is not much harder than a conversion to the particular big-endian IEEE format preferred by the netCDF format. The PDB parameterization of floating point layouts encompasses all machines where the floating point size is a multiple of an 8-bit byte, the exponent is binary, and the exponent and mantissa are contiguous sequences of bits for some permutation of the bytes. Cray 128-bit floating point formats (which have two interrelated exponents) and the hex exponent formats used by a few old mainframes are the only significant floating point formats not covered by the PDB parameterization; such machines must convert their internal formats to a form that is covered in order to read or write a PDB file, just as they must do a conversion to read or write a netCDF file. The PDB parameterization has three parts: The permutation, the specification of the bit addresses and bit sizes of the sign, exponent, and mantissa, and the bias of the exponent. Using the same notation as in the XDR RPC1014 protocol specification (sections 3.6 and 3.7), the value of a floating point number with sign S, exponent E, and mantissa (or fractional part) F is (-1)^S * 2^(E-bias) * 1.F where ^ represents exponentiation, * represents multiplication, and 1.F means 1 + (F / 2^(number of bits in mantissa)). Zero is always represented by all bits of S, E, and F zero. There must be some permutation of the bytes of the floating point number such that the bytes containing E (and F) are contiguous and ordered from most significant bits of E (and F) to least significant. In this big-endian style order, the bits can be numbered from zero to one less than eight times the number of bytes. The S, E, and F fields can then be described by specifying the bit on which they start (bit addresses), and the number of bits over which they extend (bit size). The sign always has a bit size of 1. In a PDB file, the permutation is a list of all the numbers from 1 to sizeof(float) or sizeof(double) where the value 1, 2, 3, and so on represents the location of a byte in the standard big-endian byte order defined above, and the position of the value in the list represents the position of that byte in the actual floating point number. Hence, the permutation of a for a float on a Sun SPARCstation or in an XDR file is {1, 2, 3, 4} (standard big-endian order), while the permutation of a float on a DECstation 3100 is {4, 3, 2, 1} (standard little-endian order). The VAX is the only machine with floating point formats having a non-monotonic permutation. A VAX is a little-endian machine, but its floating point format is big-endian with respect to 2-byte words, with each 2-byte word little-endian. The resulting PDB permutation for a VAX float is {3, 4, 1, 2}. The PDB convention for specifying floating point bit sizes and addresses is: byte bits_per_word 8 * number of bytes (redundant) byte exponent_size number of bits in exponent byte mantissa_size number of bits in mantissa byte sign_address bit address of sign (in standard byte order as described above) byte exponent_address bit address of exponent byte mantissa_address bit address of exponent byte mantissa_flag 0 if high order bit of mantissa is preceded by implicit 1 (1.F) 1 if high order bit of mantissa is explicitly the 1 (always set except in representation of 0.0) long exponent_bias bias of the exponent (must be less than 2^31 to fit into a long on all machines, but this is not a practical limitation) The mantissa_flag allows for one more difference between floating point formats: the 1 in 1.F is sometimes explicitly included as the first bit of F. This is the case for the Cray floating point format, and for 10 and 12 byte floating point formats on several platforms. A few explicit examples of this parameterization should remove all doubts about its meaning: #E #F &S &E &M 1? bias float {32, 8, 23, 0, 1, 9, 0, 127} (netCDF/XDR standard) double {64, 11, 52, 0, 1, 12, 0, 1023} (netCDF/XDR standard) (Above two cover the vast majority of modern machines, which are distinguished only by the permutation.) float {64, 15, 48, 0, 1, 16, 1, 16384} (Cray 1, XMP, YMP) float {32, 8, 23, 0, 1, 9, 0, 129} (VAX) double {64, 8, 55, 0, 1, 9, 0, 129} (VAX H-format) double {64, 11, 52, 0, 1, 12, 0, 1025} (VAX G-format) (permutations of VAX doubles are {2, 1, 4, 3, 6, 5, 8, 7}) double {96, 15, 64, 0, 1, 32, 1, 16382} (MacIntosh long double) (Bits 16-31 unused in this format.) Note that the permutation is not uniquely specified for the Cray and MacIntosh long double formats. In such a case, the closest permutation to one of the monotone permutations should be selected. Another way to say this is to require that in the standard big-endian order, the E (exponent) field should always have a smaller bit address than the F (mantissa) field, and the location of any unused bytes relative to E and F should be preserved. 3H. Restrictions on characters used in names ----- The use of "\001", "\002", and "\n" as separators in PDB string formats precludes their use in variable names, type names, or member names. Using the null byte 0x00 in any name would make it far more difficult to write a C program to access PDB files, so this character is illegal in any names as well. Hence 0x00, 0x01, 0x02, 0x0a, 0x0d, and 0x1f may never appear in any PDB string converted with %s in the above description. Another absolute proscription follows from the method used to indicate the type of a pointer variable, namely, a data type name (either from the structure chart or from the Primitive-Types extra block) may not contain any space, tab, or asterisk characters, " ", "\t", or "*", that is 0x20, 0x09, or 0x2a. For the same reason, structure member names may not include spaces, tabs, or asterisks. Additionally, no structure member name may contain either character marking the beginning of the dimension list, "(" or "[", that is, 0x28, or 0x5b. A more generic limitation is that no "\n"-terminated sequence of characters in the extras section begin with "%s:" where the string could be mistaken for any present or future extra_id name. For example, a data type name which appears in the Casts extra block had better not have a name like "Alignment:". This ugly possibility can be eliminated by more careful design of future extra block formats; if such a measure is not taken, then everyone sharing PDB files will need to be sure they are using the same version of the programming interface. For now, this is not really a practical problem. One final warning is that the original PDBLib programming interface imposes naming restrictions beyond those intrinsic to the PDB file format, as just described. These are as follows: the proscription of the characters "(" and "[" used to introduce dimension lists is extended to variable names in addition to structure member names. Furthermore, the period, "." or 0x2e is illegal both in variable names and in structure member names. (These restrictions arise because the PDBLib interface cracks ASCII text strings to perform partial read and write operations. If non-text-based partial read and write functions were available, the additional restrictions on characters in file names would disaapear.) The restrictions on characters used in PDB names are summarized in the following table: variable data type structure member names names names 0x00, 0x01, 0x02, NEVER NEVER NEVER 0x0a CR, 0x0d LF, 0x1f US 0x09 TAB, 0x20 SPACE - NEVER NEVER 0x2a "*" 0x28 "(", 0x5b "[" BAD - NEVER 0x2e "." BAD - BAD Here "BAD" means that the highest level functions in the original PDBLib programming interface will fail. Needless to say, the best idea is to avoid any character in the above list under all circumstances. ------------------------------------------------------------------------------ 4. A Generic Binary Data Description Language --------------------------------------------- Unidata has provided a plain-text representation for netCDF files (CDL format) which is extremely useful. The following plain-text format can of describe either a netCDF or a PDB file (as well as HDF and the majority of one-of-a-kind binary file formats which have been designed to store scientific data). The PDB format provides two capabilities lacking in the netCDF format: non-XDR primitive data formats, and definable data types including compound data structures and additional primitive types. The netCDF format offers two capabilities lacking in the PDB format: history records, and variable attributes. In a PDB file, the want of a formal provision for history records or for variable attributes can be fulfilled by variable naming conventions. The trick is to use special naming conventions to associate related variables ("x-units" might be the variable used to store the "units" attribute of the variable "x"), or to imbue a special significance to some of the variables in the file (instances of a history record sequence might be named "rec0000", "rec0001", and so on). Such tricks result in data which is accessed nearly as efficiently as with the netCDF interface to a netCDF file. Similar tricks allow single instances of data structures, such as FORTRAN common blocks, to be accessed in a netCDF file. However, an array of structure instances is very difficult to handle in netCDF format, and any file with primitive data formats other than XDR is impossible. Of course, the underlying simplicity of the netCDF format is really a virtue, not a weakness. A physics code written in FORTRAN can't really generate any data the netCDF format can't handle. In fact, in designing a PDB format suitable for holding restart or post-processing information for a physics code, one of the primary objectives is to remain within the bounds of the data describable by the netCDF format. Two considerations beyond the scope of the format of an individual file are important. The first is robustness against unexpected program or machine crashes when part, but not all, of the data in a file has been written. The second is the difficulty of handling extremely large files as a single unit -- it is much easier to deal with a few dozen smaller files than one monster. Both problems commonly arise with files used for storing history data, which is generally stored one record at a time with a long pause between the writing of one record and the next, and which may grow to a very large size. The usual PDB format, which places the self-descriptive information at the end of the file, is not very robust. Unless the structure chart, symbol table, and extras are written after each history record, only to be overwritten by the next, a crash causes the data in the file to be uninterpretable. The file format itself allows the structure chart, symbol table, and extras section to precede the data, as in a netCDF file; perhaps such a strategy should be adopted to deal with this problem. A very general way to deal with the problem of robustness is to keep two files open. Whenever new data is written at the end of the data file, its description is written at the end of the description file. The files can be merged when they are really finished, or left as eparate components of a single whole. This two-file scheme has the advantage that new data can be declared after some data has been written, without sacrificing robustness in the face of program or machine crashes. The plain-text binary data description language defined below is a designed to be usable as the format for the description member of such a pair. The problem of dealing with very large files is easy to handle in the case of history data -- just produce a family of files each having a restricted length, instead of a single giant file. Splitting history data across several files has a major impact on the programming interface used to access the data, but no effect on the format of an individual file. The most important thing to notice here is that an attractive feature of the netCDF programming interface -- that you can retrieve values of a particular variable over a range of times with a single subroutine call -- becomes much more difficult to implement if the data extends over several files. Another remark is that it is wise to copy the non-record data into each file in a history family, so that each file "makes sense" by itself. Bearing all of these remarks in mind, here is a generic binary data description language, hereby christened "Clog", which can describe the contents of any PDB or netCDF file, as well as many other binary file formats: 4A. Notation ----- The extended Backus-Nauer Form notation used in the RFC1014 XDR standard is adopted here: 1. The characters |, (, ), [, ], and * are special. 2. Terminal symbols are strings surrounded by double quotes. 3. Non-terminal symbols are strings of non-special characters. 4. Alternative items are separated by the vertical bar character |. 5. Optional items are enclosed in square brackets [ ... ]. 6. Items are grouped by enclosing them in parentheses ( ... ... ). 7. An item followed by * means zero or more occurrences of that item. 8. A non-terminal followed by : and a set of alternatives constitutes the definition of that non-terminal. Comments in a binary data description file begin with /* and end with */, and are treated as whitespace. An identifier is a letter or underscore "A-Za-z_", followed by zero or more letters, digits, underscores, pluses, minuses, periods, or commas "A-Za-z_0-9,.+-". An identifier may also consist of a quoted string, which is interpreted as the characters within the quotes, recognizing the following escape sequences: \" double quote \\ one backslash \ooo an arbitrary 8-bit byte, except that \000 and any following characters are ignored An identifier may not be more than 1023 characters long, in its printed form including the open and close quotes, if any. A number is a sequence of one or more decimal digits optionally preceded by a minus "-". A float_number is anything readable by the standard C library "%e" format directive. All control characters and spaces are treated as whitespace, principally to allow for any differences in the newline character among various operating systems. 4B. Overview of language ----- The binary data description language Clog is modeled on C variable declaration and structure definition syntax. C declarations relate a variable name, data type and dimension information. Clog variable declarations must additionally specify the disk address for the variable. Clog structure defintions are similarly extended to allow the offset of each member to be specified. This extension makes it easier to automatically generate a Clog description of some binary file formats. Following the PDB file format, Clog allows the bit-by-bit format of the primitive data types to be specified. This greatly increases the set of binary files describable using Clog. This same mechanism enables new primitive data types to be declared. Following the netCDF file format, Clog has a formal mechanism for describing a sequence of history records. This allows natural descriptions of an important class of binary files (including netCDF files). Finally, the Clog contains a formal means for including new types of descriptive information not envisioned at its inception. (This has been a valuable feature of the PDB file format.) Separate "trial" and "standard" extension syntax is provided. The rules for Clog extensions are simple: No Clog extension can alter the meaning of any previously defined part of Clog (thus, nothing like the "Major-Order" extra block in the PDB file format is acceptable). And any Clog extensions should be conceived as supplying supplemental information, rather than as wholesale replacements of existing features to get additonal functionality. 4C. Basic primitive data types ----- There are six basic primitive data types: char an 8-bit byte short a signed integer of at least 2 bytes - used when small size is important int a signed integer of at least 2 bytes - most efficient integer type, used for boolean values long a signed integer of at least 4 bytes - most commonly used integer type, e.g.- an array index float a floating point number of at least 4 bytes, range is at least 10^(+-38), precision at least 6 decimal digits double a floating point number usually 8 bytes, range is at least 10^(+-38), precision at least 9 decimal digits, but usually at least 10^(+-307) and 14 decimal digits No data structure (compound data type) or additional primitive may have one of these six names. Any one of these six names may be used as a data type without any definition. All other identifiers used as data type names must be previous defined, with the following two exceptions: string a string of 8-bit ASCII characters not containing '\0' - a string is represented as a long containing the disk address of the string; the string itself is a long (aligned as a long) with a non-negative count of the non-0 characters in the string (i.e.- the result of the ANSI C strlen function), followed by that many characters. pointer a pointer to an array of any type - the pointee contains the data type and dimension information; the pointer is represented as a long containing the disk address of the pointee If string or pointer is used as a data type without defining it, the default meaning is as shown. Once used, the string or pointer data type may not be redefined. A NULL pointer or string is represented by a disk address of -1; there is no associated pointee. 4D. Clog file layout ----- clog_description: "Contents Log" basic_statement* [ record_initializer basic_statement* record_declaration* ] [ end_of_data ] basic_statement: primitive_definition | structure_definition | variable_declaration | alignment_spec | other_information record_initializer: record_declaration | record_begin end_of_data: "+" "eod" "@" disk_address The clog_description must begin with the QUOTED string "Contents Log" -- if it is not quoted, the Clog lexical rules will divide it into two tokens. Note that comments and white space may precede the "Contents Log" token. The clog_statement's order is restricted by a general "definition before use" rule, as described in detail below. Basically, this means a data type must be defined before it can be used. If the record_initializer is present, clog_statements before it describe non-record variables, while clog_statements afterwards describe record variables. Notice that all record_declaration statements, except possibly the first, follow the clog_statements which declare the record variables. Thus, the structure of the records in a Clog description cannot change. If present, the +eod statement must be the very last thing in a Clog description of a file, even beyond any comments. The disk_address specified is the address of the first byte beyond all of the data in the file. This may be beyond the end of any variable declared in the Clog for any number of reasons; the most obvious is that there may be pointees beyond the last declared data. The entire +eod statement from the initial "+" to the final digit of the disk_address must not occupy more than 80 characters. In addition to specifying a safe address at which to begin adding data to the binary file, the +eod statement allows the entire Clog to be appended to the end of the binary file itself to make a single, self-descriptive package. (Note that this procedure does not damage either a netCDF or a PDB file.) This is done as follows: At the address specified in the +eod statement, write the entire plain-text Clog description of the file, including the final +eod statement. Then close and truncate the file. A program which opens the file can scan the last 80 bytes; if it finds a +eod statement, it can check that "Contents Log" is the first token after specified address, and, if so, interpret the Clog to determine the layout of the binary data in the file. 4E. Variable declaration ----- variable_declaration: type_name variable_name dimension_spec* ["@" disk_address] ("," variable_name dimension_spec* ["@" disk_address])* type_name: identifier variable_name: identifier dimension_spec: "[" dimension_length [dimension_name] "]" | "[" minimum_index ":" maximum_index [dimension_name] "]" dimension_name: identifier dimension_length: number minimum_index: number maximum_index: number disk_address: number The disk_address is a byte address. If omitted, the default is the next available address after all the variables previously declared. The "next available" address may be rounded up for alignment purposes, as discussed in more detail below. If more than one dimension_spec is present, the slowest varying dimension is first in the list, and the fastest varying dimension last. This is the C convention. As in C, a multidimensional array is best regarded as an array of arrays -- the first index specifies which array, so the second index must vary faster. The optional dimension_name is provided for easier compatibility with the netCDF format. Behavior is undefined if the same dimension_name is used for dimension_spec's with different lengths. The idea is to distinguish between dimension lengths which are accidentally equal, and those which are equal by virtue of their variable's meanings. If not suppied, the default dimension name is "_%ld", where "%ld" is the decimal representation of the dimension length. The alternative minimum_index:maximum_index syntax is intended to suggest the preferred range of the index values. The equivalent dimension_length is maximum_index-minimum_index+1. This information is far less important than the dimension_length, since the dimension_length values (in their proper order!) specify the topology of the array -- that is, how to find nearest neighbors along the various dimensions. The type_name and variable_name identifiers must be unique among all other type_name and variable_name identifiers, respectively. However, there are two separate name spaces, so a type_name identifier may match a variable_name identifier without conflict. The dimension_name identifiers, if any, form a third independent name space. 4F. Primitive data type definition ----- primitive_definition: "+" "define" type_name "[" size_value "]" "[" alignment_value "]" [ "[" order_value "]" ["{" sign_address exponent_address exponent_size mantissa_address mantissa_size mantissa_flag exponent_bias "}"] ] | "+" "define" "string" "standard" | "+" "define" "pointer" "standard" size_value: number alignment_value: number order_value: number | "sequential" | "pdbpointer" sign_address: number exponent_address: number exponent_size: number mantissa_address: number mantissa_size: number mantissa_flag: number exponent_bias: number The type_name must neither have been defined nor referenced previously. All primitive type definitions must have a size_value (the number of bytes one instance occupies) and an alignment_value (the largest number by which the byte offset a structure member of this type is guaranteed to be divisible). The order_value, if a number, determines the byte ordering within the size_value. If order_value is not present, the primitive type is to be regarded as opaque. If order_value is present, but { ... } is not present, the primitive type is an integer value. If { ... } is present, the primitive type is a floating point value (order_value must be present also in this case). If the order_value is the identifier "sequential", then any instance of this primitive requires sequential I/O; that is, portions of arrays of this type may not be read or written. If the size_value is 0, then the size of an instance is indeterminate. Otherwise, an instance of the object occupies the specified size and has the specified alignment as a structure member. The code reading or writing the file is responsible for recognizing the name of a sequential primitive and taking appropriate action to read or write it. The parameterization of a floating point format is the same as described above for the PDB file format. The order_value is a simplification of the general byte permutation provided by the PDB file format. The meaning of the order value is as follows: order_value ==> 1. The magnitude of the order_value represents the number of bytes per "word". Within a word, the byte order is monotone (either from most significant to least significant or vice versa). The magnitude of the order_value is thus a multiple of the size_value. If the entire object has monotone byte order, then the magnitude of the order_value is one. In practice, this is always the case, except for the VAX floating point formats. 2. The sign of the order value determines the word order. The byte order within a word is always opposite to the word order. (Otherwise the entire word is monotone, the word size is one, and the word order is the byte order.) The sign is positive if the most significant word is first, negative if the least significant word is first. 3. An order_value of zero is equivalent to omitting the order_value altogether; it indicates opaque data with the specified size and alignment. In brief, the vast majority of numeric formats fall into one of the following four categories: order_value= 1 for big-endian (MSB first) machines order_value= -1 for little-endian (LSB first) machines order_value= 0 for opaque data order_value= 2 for VAX floating point formats The precise order of definition of primitive types is significant if the file contains pointers to objects of non-predefined types. In this case, the data type of the pointee will be encoded as an ordinal based on the order of +struct and +define definitions. The exception is a +define with a type_name of one of the predefined types: char, short, int, long, float, double, string, or pointer. The order of such a +define is unimportant, provided only that it precede the first use of that predefined type. As explained in the next section, a +define of type long must also precede any uses of the predefined string or pointer types. To use the default definitions of string and pointer, you must NOT specify them using a +define (doing so would redefine them to the meaning specified in the +define), OR you must use the special "standard" forms of +define. With their default meanings, the a string and a pointer are represented as a long. In general, +defines of the six basic primitive data types should precede any other statements in the Clog description of a file. If these are not defined, the default is the standard for the machine on which the binary data file was written. Without the basic primitive type definitions, therefore, a Clog description of a file is not portable across different machine architectures. As an example, a Clog describing a netCDF file (in XDR format) would begin: +define char [1][4][1] +define short [2][4][1] +define int [4][4][1] +define long [4][4][1] +define float [4][4][1] {0 1 8 9 23 0 127} +define double [8][4][1] {0 1 11 12 52 0 1023} Note that, since a netCDF does not support data structures, the alignment_value is not really significant. For the same reason, the definition of int is unnecessary. Furthermore, a definition of a synonym for char would be appropriate: +define byte [1][4][1] 4G. Compound data structure definition ----- structure_definition: "+" "struct" type_name "{" full_member_definition member_definition* "}" full_member_definition: type_name member_name dimension_spec* ["@" byte_offset] member_definition: full_member_definition | "," member_name dimension_spec* ["@" byte_offset] member_name: identifier The leading "+" permits immediate recognition a structure_definition as opposed to a variable_declaration, without the necessity for making "struct" a reserved word in the context of a type_name. If present, the byte_offset specifies the byte offset of the member from the beginning of an instance of the data structure. If absent, the byte offset is the next available byte beyond all previously declared members; the first member has a byte_offset of 0 by default. The "next available" byte offset is always rounded up to the nearest multiple of the alignment value for the type_name of the member. The alignment value of a compound structure is the largest alignment value for any of its members (see the discussion of alignment within data structures below). Despite the fact that the byte_offset syntax allows it, no two members of a data structure may overlap. The body of each structure definition has its own name space, so the member_name need only be unique among all the member_names for the structure currently being defined. As usual, type_name in a member_definition must be either a predefined primitive, or must have been previously defined. Obviously, a structure may not contain members which are instances of itself. Beyond this "define before using" requirement, the precise order of defintion of structures is significant if the file contains pointers to objects which are structures. In this case, the data type of the pointee will be encoded as an ordinal based on the order of +struct and +define definitions. 4H. Record definition ----- record_begin: "+" "record" "begin" record_declaration: "+" "record" "{" [time_value] "," [cycle_value] "}" ["@" disk_address] time_value: float_number cycle_value: number The first occurrence of +record changes the meaning of subsequent variable_declaration statements from declaring non-record variables to declaring record variables. This first occurance may actually declare the first history record itself, or it may merely be the record_begin marker. In any record_declaration, the disk_address defaults to the next available address, as for a variable declaration. The time_value and the cycle_value need not actually represent the time and cycle number associated with the record; they can be any double and long value which characterizes a record. Note that the time_value is specified only to the accuracy it is printed; if a precise time is required, the time should be made a record variable. The intent of time_value and cycle_value is to provide more informative "names" for the record than merely its position in the sequence of records. If either time_value or cycle_value is omitted in the first history record instance declaration, it must be omitted in all following declarations; similarly, if present in the first declaration, it must be present in all subsequent declarations. Because of the possibility of families of time history files, it is very difficult to realize a programming interface which allows non-record data to be written after the writing of records has begun. This practical difficulty is the reason for the division of the Clog description into a non-record section, followed by a record section. The restriction of history data to a sequence of a single type of record, rather than allowing several interleaved sequences of records of various types, is also deliberate. If several types of record need to be written, several output files or file families should be used. Note that a member of the history record structure may be a pointer to a block of data whose size changes from record to record. For this reason, Clog records are not necessarily layed out end-to-end in the file, as are netCDF records. If the record addresses are random, the efficiency of collection of a portion of the record data across several or all records is reduced. More importantly, the Clog description of a file will fail in practice if the storage of an exceedingly large number of exceedingly small records is attempted. As a rule of thumb, you're in trouble if your record length is less than the number of bytes in the "+record" statement necessary to declare the record (this can't be less than 8 bytes). If you are in this category, you should strongly consider buffering several records and writing them as an array for the sake of efficiency anyway. 4I. Additional alignment information ----- alignment_spec: "+" "align" alignment_type "[" alignment_value "]" alignment_type: "variables" | "structs" Some files, for example netCDF files, impose additional alignment restrictions on variables. This can be specified using the "+align variables" syntax in Clog. As a special case, alignment_value of 0 means that the same alignment should be applied to a variable in a file as it were a member of a data structure. An alignment_value of 1 means that there is no padding between variables; every variabe starts on the byte immediately following the previous variable. The "@address" syntax in a variable declaration overrides the "+align variables" statement. The special value 0 is the Clog default; netCDF files would have "+align variables 4", and PDB files would have "+align variables 1". Some C compilers place an additional alignment restriction on struct members which are themselves struct instances (beyond the usual restriction that the alignment of a struct instance is the same as the alignment of its most restrictively aligned member). Such an additional alignment restriction may be expressed in Clog via the "+align structs" syntax. The value 1 is the default, meaning that there is no additional alignment restriction on struct instances. At most "+align" statement of each type is allowed in a Clog description. If present, these must come before any variables, compound data structures, or records have been declared. 4J. Predefined string and pointer formats ----- As indicated above, the type_names "string" and "pointer" have an optional predefined meaning, which is designed to map relatively easily into the C language "char *" and "void *" data types. In order to do this, descriptive information unavoidably leaks from the description of the binary data into the data itself. The following design minimizes this leakage; nevertheless, a binary file designer should avoid gratuitous use of indirect data types. An instance of either string or pointer is represented on disk as a long. Hence, no use of the default string or pointer types may precede a +define of the long primitive. The value of that long is interpreted as a disk address (as always, in bytes, with 0 meaning the address of the first byte). A disk address of -1 is taken as a NULL pointer, meaning that there is no data associated with the pointer. In the case of a string, a character count of type long will be found at the disk address specified in the (non-negative) pointer. The characters of the string, if any, begin with the byte immediately following the character count. The terminating NULL character is not included in either the character count or the string itself. In the case of a pointer, the (non-zero) pointer is the disk address of a small header describing the pointee, followed by the pointee data itself. The header is an array of long integers encoded as follows: long type_number number representing the data type of the pointee data: 0 char, 1 short, 2 int, 3 long, 4 float, 5 double, 6 string, 7 pointer, >=8 for the data types defined using +struct or +define in the order of definition in the Clog long n_dims number of dimensions (0 if scalar) long[n_dims] length[n_dims] (not present if n_dims==0) the number of elements along each dimension, in order from slowest varying to fastest varying dimension <garbage> pad any pad necessary to align the data to a disk address which would be acceptable if it were an ordinary variable <type_number> data the pointee itself 4K. Generic extension syntax ----- The preceding sections cover the basic requirements for being able to decipher the contents of a binary data file. Of course, you can't necessarily do anything with a bunch of numbers just because you can read them. In general, the meaning of the numbers in a binary file emerges only from careful documentation. The required level of documentation is appropriate to a user's manual for the program which wrote the file, not to the Clog description of the file. Nevertheless, sometimes it is appropriate to carry a higher level of meaning around with a binary data file. The Clog therefore provides a generic syntax for such information: other_information: "+" public_extension [extension_id] "{" extension_data "}" ["@" disk_address] "-" private_extension [extension_id] "{" extension_data "}" ["@" disk_address] public_extension: identifier private_extension: identifier extension_id: identifier extension_data: <any sequence of tokens with balanced "{" and "}"> The notion of a public_extension is that considerable effort be expended to ensure that the associated identifier be unique across all implementations of the Clog. A private_extension, on the other hand, can be used immediately and freely at a single site. The private_extension syntax should not be used as a substitute for the comment syntax /* ... */. A public_extension cannot have the identifiers "struct", "define", "history", or "eod". Furthermore, the following public_extensions are hereby defined, in order to prevent the corresponding inevitable private extensions: "+" "pedigree" "{" pedigree_spec ("," pedigree_spec)* "}" pedigree_spec: "created_by" "=" identifier | "creation_date" "=" identifier | "modified_by" "=" identifier | "modification_date" "=" identifier | "revision" "=" number | "archive_id" "=" identifier | "format_version" "=" number | identifier "=" identifier | identifier "=" number Blue bloods and bureaucrats demand pedigrees for their data. "+" "attributes" [variable_name] "{" attribute_spec (";" attribute_spec)* "}" attribute_spec: attribute_name ["=" attribute_value] attribute_name: identifier attribute_value: number ("," number)* | float_number ("," float_number)* | identifier The attribute extension handles netCDF-style attributes. The number and float_number tokens are extended by the suffix notation described in the netCDF User's Guide from Unidata, in the section on CDL format. An identifier as an attribute_value covers the case of a string valued attribute; it should normally be a quoted string. If the variable_name is not present, the attributes apply to the whole binary file. If present, the variable_name must specify a previously declared variable. "+" "value" variable_name "{" variable_value "}" variable_value: number ("," variable_value)* | float_number ("," variable_value)* | "{" variable_value "}" This extension is provided in order to be able to directly translate Unidata CDL files into Clog files. Just as the string and pointer data types cause a leakage of descriptive information into the data, the +value extension amounts to a leakage of data into the descriptive information. Each level of { ... } descends one level into a structure instance. "+" "PDBpointer" ("variable" | "member") "{" type_name "}" The type_name must specify an opaque data type previously defined by +define. Two separate types should be supplied; one for variables, which has size==0, and one for structure members, which has non-zero size. Both are sequential data types, since any object containing a PDB-style pointer must be read sequentially as a complete block. The following declarations would be reasonable: +define "char *" [0][4][sequential] +PDBpointer variable { "char *" } +define "char *" [4][4][sequential] /* note extra space */ +PDBpointer member { "char *" } These definitions assume that the sizeof(void *) specified in the PDB prim_info section was 4, and the ptr_align specified in the PDB Alignment extra block was 4. There is no limit on the number of different type_names which can be declared to be PDBpointer, but all of the corresponding +define statements must be identical. "+" "PDBcast" type_name "{" member_pair ("," member_pair)* "}" member_pair: cast_member "," type_member cast_member: identifier type_member: identifier The PDBcast extension handles the information in the PDB Casts extra block for PDB-style derived classes. Given the clumsy nature of the PDBpointer public_extension, this does not really work very well...